I have implemented an MPI program in C with \(2P\) messages passing among \(P\) processes on a ring topology with periodic boundaries.
The program implements a stream of messages of \(4B\) each, both clockwise and anticlockwise: each process in fact send/receive messages of type int (the \(rank\)) with a tag proportional to the rank ( \(tag=rank*10\) ).
When running on \(P\) process, the program prints out in folder ./out a file ‘np\(P\).txt’ with the following output ordered by rank (here \(P=5\)):
I am process 0 and I have received 5 messages. My final messages have tag 0 and value msg-left -10, msg-right 10
I am process 1 and I have received 5 messages. My final messages have tag 10 and value msg-left -10, msg-right 10
I am process 2 and I have received 5 messages. My final messages have tag 20 and value msg-left -10, msg-right 10
I am process 3 and I have received 5 messages. My final messages have tag 30 and value msg-left -10, msg-right 10
I am process 4 and I have received 5 messages. My final messages have tag 40 and value msg-left -10, msg-right 10
I have used the following routines for message passing among processes:
MPI_Isend: non-blocking send routineMPI_Irecv: non-blocking receive routineMPI_Wait: waits for a non-blocking operation to completeUsing latest version of OpenMPI available on ORFEO (openmpi-4.1.1+gnu-9.3.0), I have run the program up to \(P=24\) processes on a thin node with InfiniBand network and native protocol.
I have taken notes of the runtime (from the first send/receive to the last) when varying the number of processes \(P\) by means of the routine MPI_Wtime() which returns an elapsed time on the calling process in seconds.
At each run with \(P\) processes, for each process I have taken the mean of the runtime in microseconds out of 10000 iterations and print it out (along with some statistics) by ranking order on the file ‘np\(P\).csv’ in folder ./out, as follows:
| rk | P | t_mean | t_xmsg | var | s2 | N | dev | s |
|---|---|---|---|---|---|---|---|---|
| 0 | 5 | 7.073931 | 1.414786 | 1425.467 | 1425.496 | 10000 | 37.75536 | 37.75574 |
| 1 | 5 | 7.072714 | 1.414543 | 1423.095 | 1423.123 | 10000 | 37.72393 | 37.72430 |
| 2 | 5 | 7.073540 | 1.414708 | 1428.383 | 1428.412 | 10000 | 37.79395 | 37.79433 |
| 3 | 5 | 7.072797 | 1.414559 | 1419.479 | 1419.507 | 10000 | 37.67598 | 37.67635 |
| 4 | 5 | 7.074060 | 1.414812 | 1430.666 | 1430.694 | 10000 | 37.82414 | 37.82452 |
Then, by taking for each run with different \(P\) only the maximum value of t_mean between all processes as measure of the performance, I was able to produce the following plot, running the program with different mappings:
Because of the routines that I have used in the program, I expect my data to be in compliance with a \(P\) double PingPing model.
By means of IMB-MPI1 PingPing benchmark it’s possible to measure startup \(\Delta t\) and throughput \(\frac{X}{\Delta t}\) of single messages of size \(X\) that are obstructed by oncoming messages. To achieve this, two processes communicate with each other using MPI_Isend/MPI_Recv/MPI_Wait calls, just like in my program, as follows:
Fig.1 PingPing Pattern from Intel® MPI Benchmarks User Guide and Ring with P=3.
With IMB-MPI1 PingPing benchmark I was able to estimate the latency \(\lambda_{net}\) and bandwidth \(b_{net}\) on Infiniband network with different processes mappings: across nodes, sockets and cores. Since my \(t_{mean}\) is a measure of time in \(\mu s\) of a pair of opposite messages passing through all \(P\) process till returning back to the original one, \(t_{xmsg}=\frac{t_{mean}}{P}\) is the variable accounting for half of the \(\Delta t\) time measured from the PingPing benchmark results.
Thus, by taking the inverse of the previous relation \(t_{mean}=t_{xmsg}P\), our theoretical model will be \(t_{theo}=2\Delta t_{ping}{P}\), where \(\Delta t_{ping} = \lambda_{ping} + \frac{4B}{b_{ping}}\) with \(\lambda_{ping}\) and \(b_{ping}\) estimated by least square method on the PingPing benchmark results.
Hence, the following communication model has been used for plotting my data with the PingPing model:
Oss. The extimated bandwiths of the network for the different mappings are:
As expected, the measured bandwith for the 100 Gbit InfiniBand network (process message passing among nodes) reaches up to approximately 12000 MB/s (\(= 12\) GB/s \(= 12 *8\) Gbit/s \(= 96\) Gbit/s) which is \(96\%\) of the theoretical peak performance. Anyway this term in the model can easily be omitted since it’s in the order of \(10^{-10}\).
The estimated latency between two nodes \(\lambda_{node}\) with the PingPing benchmark is somewhat less than the one declared by Mellanox switch constructor (\(1.35\) \(\mu s\)) used in InfiniBand network: this fact seems to hold when few processes are communicating.
I was indeed expecting that the theoretical performance (evaluated by PingPing benchmark among just 2 process) would have been better than the real scenario when all cores/sockets in nodes are being used simultaneously, and in fact as can be seen from the previous plot, my model seems to work good for core and socket mapping, but quite bad for node mapping.
Data from node mapping seem to follow another model that is half the expected one: \(t(P)=P(\lambda_{mlx5} + \frac{4B}{b_{node}})\), where \(\lambda_{mlx5}=1.35\) is the latency of the Mellanox switch indeed. This may suggest that the switch is able to send simultaneously 2 messages of \(4B\) in each direction of the network, halving the time for message passing between processes placed in different nodes.
The Intel MPI IMB-MPI1 benchmark PingPong has been used to estimate latency \(\lambda_{net}\) and bandwidth \(b_{net}\) of all available combinations of topologies and networks on ORFEO computational nodes, using both IntelMPI and openmpi latest versions libraries availables.
Let’s start by looking at ORFEO computational nodes and resources:
[valinsogna@login 2021Assignement01]$ pbsnodes -ajS
mem ncpus nmics ngpus
vnode state njobs run susp f/t f/t f/t f/t jobs
--------------- --------------- ------ ----- ------ ------------ ------- ------- ------- -------
ct1pf-fnode001 job-busy 1 1 0 560gb/1tb 0/36 0/0 0/0 56793
ct1pf-fnode002 free 1 1 0 1tb/1tb 12/36 0/0 0/0 55267
ct1pt-tnode001 job-busy 1 1 0 54gb/754gb 0/24 0/0 0/0 56845
ct1pt-tnode002 job-busy 1 1 0 54gb/754gb 0/24 0/0 0/0 56846
ct1pt-tnode004 job-busy 1 1 0 54gb/754gb 0/24 0/0 0/0 56847
ct1pt-tnode005 job-busy 1 1 0 54gb/754gb 0/24 0/0 0/0 56794
ct1pt-tnode006 free 0 0 0 754gb/754gb 24/24 0/0 0/0 --
ct1pt-tnode007 job-busy 1 1 0 54gb/754gb 0/24 0/0 0/0 56795
ct1pt-tnode008 job-busy 1 1 0 54gb/754gb 0/24 0/0 0/0 56848
ct1pt-tnode009 job-busy 1 1 0 754gb/754gb 0/24 0/0 0/0 57073
ct1pt-tnode010 job-busy 1 1 0 754gb/754gb 0/24 0/0 0/0 57073
ct1pt-tnode003 offline 0 0 0 754gb/754gb 24/24 0/0 0/0 --
ct1pg-gnode001 free 1 1 0 252gb/252gb 44/48 0/0 0/0 57057
ct1pg-gnode002 free 1 1 0 252gb/252gb 24/48 0/0 0/0 56354
ct1pg-gnode003 free 1 1 0 252gb/252gb 24/48 0/0 0/0 56862
ct1pg-gnode004 free 0 0 0 252gb/252gb 48/48 0/0 0/0 --
On ORFEO there are:
The MPI point to point performance has been measured only on thin and gpu nodes.
Now we can look at the node topologies for the thin and gpu ones. This can be either done by typing on the selected node module load likwid and then likwid-topology or using module load hwloc and lstopo. The figure below represents the lstopo output on a thin node, which is more or less the same for the gpu node.
Fig.2 Topology on thin node
The main differences between a gpu and a thin node are the hyper-threading enabled on the previous, the different RAM size and the different CPUs models:
As can be seen by typing lscpu:
[valinsogna@ct1pt-tnode008 ~]$ lscpu
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 24
On-line CPU(s) list: 0-23
Thread(s) per core: 1
Core(s) per socket: 12
Socket(s): 2
NUMA node(s): 2
Vendor ID: GenuineIntel
CPU family: 6
Model: 85
Model name: Intel(R) Xeon(R) Gold 6126 CPU @ 2.60GHz
Stepping: 4
CPU MHz: 3299.999
CPU max MHz: 3700.0000
CPU min MHz: 1000.0000
BogoMIPS: 5200.00
L1d cache: 32K
L1i cache: 32K
L2 cache: 1024K
L3 cache: 19712K
NUMA node0 CPU(s): 0,2,4,6,8,10,12,14,16,18,20,22
NUMA node1 CPU(s): 1,3,5,7,9,11,13,15,17,19,21,23
[valinsogna@ct1pg-gnode001 ~]$ lscpu
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 48
On-line CPU(s) list: 0-47
Thread(s) per core: 2
Core(s) per socket: 12
Socket(s): 2
NUMA node(s): 2
Vendor ID: GenuineIntel
CPU family: 6
Model: 85
Model name: Intel(R) Xeon(R) Gold 6226 CPU @ 2.70GHz
Stepping: 7
CPU MHz: 3499.999
CPU max MHz: 3700.0000
CPU min MHz: 1200.0000
BogoMIPS: 5400.00
Virtualization: VT-x
L1d cache: 32K
L1i cache: 32K
L2 cache: 1024K
L3 cache: 19712K
NUMA node0 CPU(s): 0,2,4,6,8,10,12,14,16,18,20,22,24,26,28,30,32,34,36,38,40,42,44,46
NUMA node1 CPU(s): 1,3,5,7,9,11,13,15,17,19,21,23,25,27,29,31,33,35,37,39,41,43,45,47
The main differences among these two are the usage of Sockets Interface and TCP/IP protocol for the Ethernet network, and the usage of much more rapid OpenFabrics Verbs (no Kernel stack) with the native IB protocol for InfiniBand network. Moreover, IB protocol has RDMA (Remote Direct Memory Access) that makes InfiniBand faster: it is an operation which access the memory directly without involving the CPU. We thus, expect InfiniBand network with native IB protocol to be much faster (low \(\lambda_{net}\), high \(b_{net}\)) then the Ethernet network. Moreover, I am going to test also InfiniBand network performance when applying plain IP protocol: IPoIB (IP-over-InfiniBand) is the protocol that defines how to send IP packets over IB, passing through the Kernel space.
As can be seen from Fig.2, the network cards are placed in each node only inside one of the two NUMA domains, and there are several PCI (Peripherical Component Interconnect) devices that can be seen by typing ifconfig:
[valinsogna@ct1pt-tnode007 ~]$ ifconfig
bond0: flags=5187<UP,BROADCAST,RUNNING,MASTER,MULTICAST> mtu 1500
inet6 fe80::3680:dff:fe4e:5568 prefixlen 64 scopeid 0x20<link>
ether 34:80:0d:4e:55:68 txqueuelen 1000 (Ethernet)
br0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500
inet 10.128.2.127 netmask 255.255.255.0 broadcast 10.128.2.255
inet6 fe80::3680:dff:fe4e:5568 prefixlen 64 scopeid 0x20<link>
ether 34:80:0d:4e:55:68 txqueuelen 1000 (Ethernet)
em1, em2 two physical cards that we have on the node
em1: flags=6211<UP,BROADCAST,RUNNING,SLAVE,MULTICAST> mtu 1500
ether 34:80:0d:4e:55:68 txqueuelen 1000 (Ethernet)
em2: flags=6211<UP,BROADCAST,RUNNING,SLAVE,MULTICAST> mtu 1500
ether 34:80:0d:4e:55:68 txqueuelen 1000 (Ethernet)
ib0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 2044
inet 10.128.6.127 netmask 255.255.255.0 broadcast 10.128.6.255
inet6 fe80::ba59:9f03:d4:27d6 prefixlen 64 scopeid 0x20<link>
Infiniband hardware address can be incorrect! Please read BUGS section in ifconfig(8).
infiniband 00:00:09:07:FE:80:00:00:00:00:00:00:00:00:00:00:00:00:00:00 txqueuelen 256 (InfiniBand)
lo: flags=73<UP,LOOPBACK,RUNNING> mtu 65536
inet 127.0.0.1 netmask 255.0.0.0
inet6 ::1 prefixlen 128 scopeid 0x10<host>
loop txqueuelen 1000 (Local Loopback)
As from above, em1, em2 are two physical distinguished Ethernet cards that refers to one interface br0, whilst ib0 is InfiniBand. More details are shown below with lstopo:
[valinsogna@ct1pt-tnode007 ~]$ lstopo
Machine (754GB total)
Package L#0
NUMANode L#0 (P#0 376GB)
L3 L#0 (19MB)
L2 L#0 (1024KB) + L1d L#0 (32KB) + L1i L#0 (32KB) + Core L#0 + PU L#0 (P#0)
L2 L#1 (1024KB) + L1d L#1 (32KB) + L1i L#1 (32KB) + Core L#1 + PU L#1 (P#2)
...
HostBridge
PCI 00:11.5 (SATA)
PCI 00:17.0 (SATA)
PCIBridge
PCIBridge
PCI 03:00.0 (VGA)
HostBridge
PCIBridge
PCI 18:00.0 (RAID)
Block(Disk) "sda"
PCIBridge
PCI 19:00.0 (Ethernet)
Net "em1"
PCI 19:00.1 (Ethernet)
Net "em2"
HostBridge
PCIBridge
PCI 3b:00.0 (InfiniBand)
Net "ib0"
OpenFabrics "mlx5_0"
With openMPI implementation and UCX, it’s possible to directly select the devices (using UCX_NET_DEVICES specification in the run command), that lead to a specific protocol as consequence. The devices tested with openMPI across nodes are:
TCP/IP, IPoIB and native IB protocols
Intel MPI benchmark IMB-MPI1 PingPong measure message passing between two processes and works as follow:
PingPong Pattern from Intel® MPI Benchmarks User Guide
It reports the time \(\Delta t/2\) (in \(\mu s\)), throughput \(\frac{2X}{\Delta t}\) (in MB/s), number of repetitions and the message size \(X\) (in B) in the standard output:
#---------------------------------------------------
# Benchmarking PingPong
# #processes = 2
#---------------------------------------------------
#bytes #repetitions t[usec] Mbytes/sec
0, 1000, 0.20, 0.00
1, 1000, 0.21, 4.90
2, 1000, 0.20, 10.07
4, 1000, 0.22, 20.26
8, 1000, 0.24, 40.59
The bandwidth and the latency estimation for gpu and thin nodes is done across cores, sockets and nodes, with different protocols, PCI devices and libraries.
The nodes involved are:
Each specific run is repeated 10 times with -msglog 28 and each time for the thin ones the entire node was reserved in order to reduce noise in measurements, instead for the gpu nodes it wasn’t possible due to some queue traffic. Thus, from the measurements, it was taken the mean time \(\Delta t/2\) and the mean throughput \(2X/\Delta t\) between the 10 results for each message size \(X\), along with some statistics (ex. maximum error on time and on throughput \(e = \frac{t_{max} - t_{min}}{2}\)).
The PingPong benchmark is compiled on either the thin or gpu node using two different libraries:
UCX_NET_DEVICES specification and latest openmpi-4.1.1 for the others.intel available on ORFEO.With OpenMPI implementation it is possible to specify the MPI frameworks and plugins (like pml and btl) that can be used while running. Across nodes, cores and sockets the following as been tested:
The following graphs show the results both for thin and gpu nodes from plotting the benchmark throupout \(2X/\Delta t\) against the message size \(X\) with different mappings of the processes: across two different nodes for the first plot and across two sockets or in the same socket for the latter.
Along with the results of ‘across nodes’ mapping, there has been drawn the lines for the 100 Gbit InfiniBand and 25 Gbit Ethernet, in order to compare the theoretical peak performances with the tested ones.
Moreover, in the plot ‘by socket/core’ mapping, cache size lines have been drawn too in order to discuss about a general behavior.
From this first plot it is evident that with Intel MPI, UCX OpenMPI InfiniBand (the default command) and UCX native IB with mlx5_0:1 device (which is the same as the previous), the asymptotic bandwith is very high compared to the others protocols/devices and similar to the one expected from the theoretical peak: more or less my data perform like \(97\%\) of the theoretical limit (see eq (2) ).
UCX with br0 and OB1 with tcp show a real maximum performance of about 86%-89% respectively of theoretical bandwidth: it is a good result if we take into account that tcp protocol is heavier with respect to the native IB one (encoding, no RDMA available). Their latencies are comparable (see table in ‘Results’ section), but OB1 has a highier bandwidth.
IPoIB (aka UCX ib0) happen to have a good latency (see table in ‘Results’ section) but the bandwidth is not so high and in fact it is comparable to the one of the 25 Gbit Ethernet network.
Now let’s look at the performance with the across core and socket configuration.
From first glance it is clear that these performance are smaller in bandwith, but better in latency (see table in ‘Report section’) when compared to across nodes configuration. Moreover, mapping the processes in the same socket show often a better performance than mapping them in separate sockets .
The drop in bandwidth before the asymptotic plateau of throughput is common in both the 2 configurations (by sockets, by nodes) and it shows very different performances among different implementations. This pattern happens between 32KB and 16 MB included: after that size, measurements become stable.
By plotting the cache lines it is clear that this behavior must be related to the cache: in fact, when the size of the PingPong message becomes too large to fit in L1, there’s an L1 miss and data are retrieved from cache L2. This happens too for L2 as the size grows so there is a visible drop in the throughput. L1 effects is not clearly visible from the bandwidth ( probably due the latency ), instead for L2 misses after \(1\) MB, all implementations start loosing bandwidth. In the plateau area, instead, the message size is larger than all caches and a stable bandwith is reached.
Now let’s compare protocols. Firstly, UCX IB shows poor performance before 131 MB respect to the other protocols and after that size there is an huge increase. Also Intel InfiniBand, this time shows poor performance respect UCX implementation, and it might be related to the fact that it undergoes large cache misses. OB1 implementation seems to be better by usage of vader btl rather than tcp, as expected since vader is suitable for shared memory transfer.
Now let’s have a look at the GPU results in terms of throughput.
As can be inferred from the previous plot, Gpu nodes behave like thin node, no major difference pops out, but GPU node is a bit slower (about 15%) than thin node, as already discovered in the exercise 1. (ricorda aggiungere dati sul gpu ex1)
This can be caused either by the fact that not being able to reverse a whole node (if not 2 in the across node configuration) or, more likely, by the different CPU frequencies and node configurations: remember that even if gpu CPU has a theoretical higher maximum frequency (2.70GHz) with respect to the one of the thin node (2.60GHz), these frequencies are much smaller when running simultaneously on several cores.
Cache size are the same as thin node, then cache effects are similar, as it is shown here:
Here are shown the estimation via least square methods of the latency \(\lambda_{net}\) and bandwidth \(b_{net}\) of the networks used. To see the fit model estimation look in the folder section2/mydata/csv-files/.
Latency and bandwith fit estimation
It’s a prototype for as stencil-based iterative method used in numerical analysis to solve partial differential equations.
In its most straightforward form, it can be used for solving the diffusion equation for a scalar function \(\Phi (r,t)\):
\[\frac{\delta \Phi}{\delta t} = \Delta \Phi\] on a rectangular lattice subject to Dirichlet boundary conditions. The differential operators are discretized using finite differences: \[\frac{\delta \Phi (x_i, y_i)}{\delta t} = \frac{\Phi (x_i+1, y_i)+\Phi (x_i-1, y_i)-2\Phi (x_i, y_i)}{\delta x^2} + \frac{\Phi (x_i+1, y_i+1)+\Phi (x_i, y_i-1)-2\Phi (x_i, y_i)}{\delta y^2}\] In each time step, a correction \(\delta \Phi\) to \(\Delta \Phi\) at coordinate \((x_i,y_i)\) is calculated using the “old” values from the four next neighbor points. Of course, the updated \(\Phi\) values must be written to a second array. After all points have been updated (a “sweep”), the algorithm is repeated till some basic convergence bounded by a threshold \(eps\).
The following is a 2D Implementation of the Jacobi algorithm on an N × N lattice, with a convergence criterion added, taken by G.Hager and G.Wellein “Introduction to high performance computing for scientists and engineers”:
double precision, dimension(0:N+1,0:N+1,0:1) :: phi
double precision :: maxdelta,eps
integer :: t0,t1
eps = 1.d-14 ! convergence threshold
t0=0;t1=1
maxdelta = 2.d0*eps
do while(maxdelta.gt.eps)
maxdelta = 0.d0
do k = 1,N
do i = 1,N
phi(i,k,t1) = ( phi(i+1,k,t0) + phi(i-1,k,t0)
+ phi(i,k+1,t0) + phi(i,k-1,t0) ) * 0.25
maxdelta = max(maxdelta,abs(phi(i,k,t1)-phi(i,k,t0)))
enddo
enddo
! swap arrays
i = t0 ; t0=t1 ; t1=i
enddo
In a parallel computation, the core of the code implementation is similar but:
For this exercise, we used as Jacobi solver 3D program given by the professor Cozzini written in FORTRAN.
In order to predict 3D Jacobi model performance \(P\) the following model has been used: \[\begin{equation} \tag{*} P(L,N)=\frac{L^3N}{T_s(L)+T_c(L,N)}\space \bigg[ \frac{MLUP}{s} \bigg] \end{equation}\] where:
The latter is modeled by simply estimating the latency \(\lambda\), bandwidth \(b\) of the network and messages size \(c\).
In a 3D problem, the number of points used to estimate a point of the subdomain is not 4 (5 stencil case) but 6 instead (7 stencil case). When processes communicates whit each other they simultaneously sends each of the 6 point-to-point communication at the same time, so we need to consider full duplex data transfer for bandwidth.
Moreover we assume that the time for copying the halo to/from an intermediate buffer and the communication of a process with itself come at no costs. \(T_c\) then can be model as:
\[\begin{equation} \tag{**} T_c(L,N)=\frac{c(L)}{b}+k\lambda\space [s] \end{equation}\]
where:
Then \(c\) is, : \[\begin{equation} \tag{***} c(L)=L^2k\cdot2\cdot8\space [B] \end{equation}\]
where:
Since we use a 3D Jacobi FORTRAN program (column-major order), there are domains decomposition that are more suitable than others when dealing with contiguous location in memory for halo exchange.
In order to find the best decomposition possible, several tests has been done in order to find the most performing permutation of (\(N_x, N_y, N_z\)).
With \(N=6\) take \((6,1,1)\) \((1,6,1)\) \((1,1,6)\) decompositions when \(L=1200\) on a thin node: experimentally, the first configuration behave better than the other in terms of MLUP/s (oss. use large values of \(L\) to see this effect). Example:
| Mapping | Nx | Ny | Nx | MLUP/s |
|---|---|---|---|---|
| core | 6 | 1 | 1 | 668 |
| core | 1 | 6 | 1 | 665 |
| core | 1 | 1 | 6 | 665 |
| socket | 6 | 1 | 1 | 674 |
| socket | 1 | 6 | 1 | 671 |
| socket | 1 | 1 | 6 | 670 |
Instead when comparing always with \(N=6\), \((6,1,1)\) and \((3,2,1)\) the first has more buffering cost involved but the latter has more halo exchanges, thus both the two configurations will be taken into account.
As a first step I have tried to compile and run the code on single processor of a thin and gpu node to estimate the serial time on one single core \(T_s\).
I have used OpenMPI implementation on ORFEO both on thin and gpu nodes using native IB protocol.
The size of the subdomain \(L=700\) was fixed and used both for gpu and thin measurements since, due to RAM size on the gpu node (see pbsnodes -ajS in Exercise 2), it corrisponds to the maximum amount of memory availble for comparison between the two different nodes.
The Jacobi program was runned on:
The jacobi3D program prints out the performance in \(MLUP/s\) as last column for each run: I have taken the mean of the 10 results. Along with the performance measurements, I took the elapsed time by means of /usr/bin/time -f "user : %U system: %S elapsed: %e CPU: %P CMD: %C \n" run command.
The following is an example of running with \(L=700\) and \(N=(4,1,1)\):
# Threadlevel provided: 3 out of 0 1 2 3
spat_dim , proc_dim, PBC ?
1 -Dim Input 2800 4 T
2 -Dim Input 700 1 T
3 -Dim Input 700 1 T
# 0 Process grid 4 1 1
0 3 2
2 3 2
3 3 2
1 3 2
# 0 Allocated 2 arrays with a total of 5535.17 Mbytes
# StartResidual 0.00000000000
4 Maxtime , Mintime + JacobiMi , JacobiMa 3.06243716300 3.06242734200 2.98796024800 2.99414415400
Residual 12273333.3238 MLUPs 448.009192344
4 Maxtime , Mintime + JacobiMi , JacobiMa 3.06078858000 3.06078676400 2.98838307400 2.99486265400
Residual 2041666.66667 MLUPs 448.250496282
4 Maxtime , Mintime + JacobiMi , JacobiMa 3.06090752400 3.06090532700 2.98793178100 2.99505510000
Residual 434799.382550 MLUPs 448.233077688
4 Maxtime , Mintime + JacobiMi , JacobiMa 3.06095547000 3.06095308700 2.98808474500 2.99490125800
Residual 129179.526716 MLUPs 448.226056683
4 Maxtime , Mintime + JacobiMi , JacobiMa 3.06516349800 3.06516215100 2.99001243700 2.99797766000
Residual 52905.8427676 MLUPs 447.610706866
4 Maxtime , Mintime + JacobiMi , JacobiMa 3.06256750200 3.06256510400 2.98815335100 2.99561585600
Residual 27258.4295825 MLUPs 447.990125639
4 Maxtime , Mintime + JacobiMi , JacobiMa 3.06197104500 3.06196926100 2.98880531000 2.99486067300
Residual 16225.1238619 MLUPs 448.077391927
4 Maxtime , Mintime + JacobiMi , JacobiMa 3.06272575000 3.06272477100 2.98845891200 2.99545644500
Residual 10594.4563848 MLUPs 447.966978434
4 Maxtime , Mintime + JacobiMi , JacobiMa 3.06178146800 3.06177916200 2.98808213100 2.99487845400
Residual 7372.72712122 MLUPs 448.105135634
4 Maxtime , Mintime + JacobiMi , JacobiMa 3.06086679800 3.06086465500 2.98821124300 2.99498761300
Residual 5375.90215905 MLUPs 448.239041600
The following tables show the results of performance from relations \((*)\), \((**)\) and \((***)\) for both thin and gpu node with different mapping of the processes.
For evaluating \(T_c\), the bidirectional bandwith \(b\) and latency \(\lambda\) considered are taken by Intel® IMB-MPI1 PingPing Benchmark.
The model seems quite accurate for the thin node: our data seems to differ of about 3% with respect to the model, expect for the node mapping with 12 process in which it arise up to 5%.
Mapping by socket lead a sightly better performance respect core mapping and this is not what expected by theoretical model. Socket mapping leads to poor communication performance but it exploits better the memory allocation. When one socket is filled with 12 processes the entire grid is stored on socket competence, but spreading processes across 2 socket lead to better exploit of bandwidth using different memory controller and then all memory channel available. This solver at each iteration elaborates an huge amount of memory, given \(L=700\) and \(12\) worker at least 250 GB of ram are elaborated and more than one third of the memory is saturated.
Also with map-by node option experimentally we obtain better results when several processes are involved.
For the gpu node, the model doesn’t seem quite accurate: data in the socket mapping differs of about 5%, 10%, 33% (using 12, 24, 48 processes respectively) with respect to the model, whilst in the core mapping they do about 10%, 10%, 33%.
On gpu node differences on socket and core mapping are even bigger: again we can guess the memory as main cause. Gpu node setup even if has higher CPU frequency, doesn’t exploit all possible memory channel. When worker number grow over physical core number the real performance, due hyper-threading, goes dramatically down respect the model. Also the execution time show poor performance with 48 workers.
Since a 3D array in memory can be represented with a unique linear array, the most effective way to sum it in parallel is using collective operation.
The shape of the array doesn’t make any difference, also the virtual topology and its relative domain decomposition. We can assume that the problem is not sensible to any topology, since that there is no need of communication between neighbors. The most efficient way is to use MPI_Scatterv and MPI_Gatherv collective routine.
The expectations for this code are not so high in terms of scalability, matrix sum assuming \(n\) elements each, require \(2n\) memory read accesses, \(n\) memory write, a lot of communication and only \(n\) floating point operations. Parallel portion of code then is small respect serial one. Using Ahmdal’s law we can make a prediction of speed up bound.
Testing parallel code with two \(2400\times1000\times700\) matrices filled with double and using only one processor we can see that the parallel part take only \(2.63\) seconds over a total runtime of \(49\) seconds, representing about \(5\%\) of execution time (Elapsed is measured using /usr/bin/time and detailed Scatter and Gather with MPI_Wtime() ). Total execution take into account of matrix initialization and error checking.
This graph represent then the theoretical maximum speedup supported by Ahmdal’s law assuming roughly only \(5\%\) of parallel code and communication code and serial code in remaining part and fixed respect processors numbers. This experimentally is a good approximation and catch the trend between [1,24] processes.
|
|